Intro

NurseBridge (NB) wants to create a data driven predictive model allowing rural healthcare systems to make more informed decisions regarding compensation offerings at the most granular unit of time possible.

Data

Goal is to use this data to categorize risk of death by day of the week on a county level.

We will start with monthly data and attempt to work into the day of the week. Due to anonymization of data, days of the week are specified but not the actual dates. I.e, All Fridays of the month are bundled into a single Friday variable.

Underlying cause of death 2018-21

  • Data was pulled from the CDC website by grouping by County, Month, and Weekday.
  • Suppressed death counts were added via the CDC to protect privacy, and mask any death values number 1-9.
    • Suppressed values replaced with 5.

Glimpse

Initial imports were performed on Texas and Oklahoma ranging from 2018-21 as these two locations were specifically mentioned by NB as some of their primary areas of focus.

adj_death values are calculated by substracting COVID deaths from in-hospital deaths. crude_rate_10k values are calculated by dividing adj_death values by county_population values and multiplying by 10,000. This gives a mortality rate per 10,000 people per county.

df_wkday %>% glimpse()
## Rows: 127,104
## Columns: 9
## $ year                 <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2…
## $ month                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3…
## $ wkday                <fct> Sunday, Monday, Tuesday, Wednesday, Thursday, Fri…
## $ state                <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, O…
## $ county               <fct> Adair, Adair, Adair, Adair, Adair, Adair, Adair, …
## $ county_code          <dbl> 40001, 40001, 40001, 40001, 40001, 40001, 40001, …
## $ county_population    <dbl> 22082, 22082, 22082, 22082, 22082, 22082, 22082, …
## $ wkday_adj_deaths     <dbl> 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1…
## $ wkday_crude_rate_10k <dbl> 0.4528575, 0.4528575, 0.0000000, 0.4528575, 0.000…
df_monthly %>% glimpse()
## Rows: 15,888
## Columns: 8
## $ year                   <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018,…
## $ month                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3,…
## $ date                   <date> 2018-01-01, 2018-02-01, 2018-03-01, 2018-04-01…
## $ state                  <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK,…
## $ county                 <fct> Adair, Adair, Adair, Adair, Adair, Adair, Adair…
## $ county_code            <dbl> 40001, 40001, 40001, 40001, 40001, 40001, 40001…
## $ monthly_adj_deaths     <dbl> 16, 20, 15, 14, 1, 11, 10, 17, 1, 11, 1, 17, 1,…
## $ monthly_crude_rate_10k <dbl> 7.2457205, 9.0571506, 6.7928630, 6.3400054, 0.4…
df_annual %>% glimpse()
## Rows: 1,324
## Columns: 7
## $ year                  <fct> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, …
## $ state                 <fct> OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, OK, …
## $ county                <fct> Adair, Alfalfa, Atoka, Beaver, Beckham, Blaine, …
## $ county_code           <dbl> 40001, 40003, 40005, 40007, 40009, 40011, 40013,…
## $ county_population     <dbl> 22082, 5754, 13838, 5319, 21709, 9485, 47192, 28…
## $ annual_adj_deaths     <dbl> 156, 39, 88, 38, 145, 79, 322, 258, 707, 349, 31…
## $ annual_crude_rate_10k <dbl> 70.64577, 67.77894, 63.59300, 71.44200, 66.79257…

Suppressed

Early on in EDA it became evident that some replaced Suppressed values were skewing data in counties with low populations. For example, Loving, TX had just a population of 57, but with 5 deaths the crude rate is at a dramatic 877 deaths per 10k people.

df_annual_5 %>%
  select(state, county, county_population, annual_adj_deaths, annual_crude_rate_10k) %>%
  arrange(-annual_crude_rate_10k) %>% 
  slice(1:100)

Histogram

When Suppressed values are replaced with 5, we also see it’s possible to have negative death rates. For example, if there are 5 months in a year with Suppressed COVID hospital deaths, they are all converted to the value 5. The summarised value would then be 25 COVID hospital deaths. The non-suppressed data shows, however, that the total in-hospital deaths over that same year is 10. Since our adjusted death rate subtracts in hospital deaths from COVID deaths we end up with 10 - 25 = -15 in-hospital, non-COVID deaths. Logically, this doesn’t make sense and we could assume the true suppressed values are between 1 and 2 for each fo those months.

This histogram below illustrates the issue with using a value of 5 for Suppressed values, resulting in negative value as well as dramatically high rates for low population counties.

ggplotly(
  df_annual_5 %>% 
    ggplot(aes(x=annual_crude_rate_10k))+
    geom_histogram(bins=50, fill=NA, color="black")+
    geom_vline(xintercept = 0, color="red")+
    labs(
      title="Distribution of annual crude death rates"
    )
  )

Replace with 1

If we replace values with 1 instead of 5 we see the distribution follows a more normal curve as would be expected. However, we do see that some rates appear to still fall into the negative ranges. These specific examples are again artifacts of our imperfect replacement with 1.

ggplotly(
  df_annual_neg %>%
    ggplot(aes(x=annual_crude_rate_10k))+
    geom_histogram(bins=50, fill=NA, color="black")+
    geom_vline(xintercept = 0, color="red")+
    labs(
      title="Better distribution of annual crude death rates"
    )
  )

Because we don’t want to remove any counties from representation we will simply add 3 to the annual_hospital_deaths values so that the rate remains positive. An ideal solution may have been to perform a more advanced imputation with a glm, but this gets us in the right direction.

The rows in question with negative values.

df_annual %>% 
  filter(annual_crude_rate_10k < 0)

Modifying said rows and displaying the final adjusted calculated variables.

ggplotly(
  df_annual %>% 
      ggplot(aes(x=annual_crude_rate_10k))+
      geom_histogram(bins=50, fill=NA, color="black")+
      # geom_vline(xintercept = 0, color="red")+
      labs(
        title="Final distribution of annual crude death rates"
      )
)

EDA Really

A boxplot shows the outlier clearly.

df_annual %>% 
  ggplot(aes(x=year, y=annual_crude_rate_10k, group=year))+
  geom_boxplot()+
  facet_wrap(~state, scales='free_y')

If we filter out such low populations areas we may negatively impact certain counties. There are a few options:

  • Keep the data as it is (skewed results)
  • Replace values with 1 instead of 5 (less skewed results?)
  • Use the lower value (1) to prevent heavily skewed results (i.e., Loving, TX)
  • Remove all Suppressed values (negative impact?)
  • Impute missing values (MICE, GLM, etc)

We can see the distributions change subtly for Texas. Also noteworthy is that some adjusted death rates end up negative and may just need reset to a 0 value. \[Adjusted Deaths = In Hospital Deaths - COVID Deaths\] Annual COVID deaths are calculated by summarising monthly COVID deaths. In some cases these are all very low values, and substituting with all 5’s may actually increase the number significantly above the in hospital death count therefore producing a negative number.

g1 <- df_annual %>% 
  filter(county != "Loving") %>% 
  ggplot(aes(x=year, y=annual_crude_rate_10k, group=year))+
  geom_boxplot(outlier.colour = NA)+
  geom_jitter(alpha=.1, width=.1)+
  facet_wrap(~state, scales='free_y')+
  labs(
    title = "Death rate over time by State",
    subtitle = "without Loving, TX"
    )+
  scale_y_continuous(breaks = seq(0,200,50))

g2 <- df_annual %>% 
  filter(annual_adj_deaths > 10) %>% 
  ggplot(aes(x=year, y=annual_crude_rate_10k, group=year))+
  geom_boxplot(outlier.colour = NA)+
  geom_jitter(alpha=.1, width=.1)+
  facet_wrap(~state, scales='free_y')+
  labs(
    title = "Death rate over time by State",
    subtitle = "Adjusted deaths > 10 only",
    )+
  scale_y_continuous(breaks = seq(0,200,50))

grid.arrange(g1,g2)

EDA

  • How have death rates changed over time for each state?
    • by year
    • by month-year

Annual

Overall rates seem to have dropped off over time.

df_annual %>% 
  filter(annual_crude_rate_10k < 750) %>% 
  mutate(annual_crude_rate_10k = ifelse(annual_crude_rate_10k < 1, 0, annual_crude_rate_10k)) %>% 
  ggplot(aes(x=year, y=annual_crude_rate_10k, group=year))+
  geom_boxplot()+
  facet_wrap(~state)

Monthly

df_monthly %>% 
  group_by(year, month, state) %>% 
  summarise(crude_mean = mean(monthly_crude_rate_10k)) %>% 
  mutate(date = ymd(paste0(as.character(year), "-", as.character(month), "-01"))) %>% 
  ggplot(aes(x=date, y=crude_mean, group=year))+
  geom_boxplot(outlier.colour = NA)+
  geom_jitter(alpha=.4, width=.04)+
  facet_wrap(~state)

df_monthly %>% 
  filter(state == "OK") %>% 
  filter(monthly_crude_rate_10k < 750) %>% 
  ggplot(aes(x=date, y=monthly_crude_rate_10k, group=county))+
  geom_point()+
  geom_line(aes(group=county))+
  facet_wrap(~county)

Rates over time

Notes

  • get high/med/low ranges for categorical
  • create another model
  • start presentation
  • try to visualize results